1 Executive Summary

This is intended to be a handy cheatsheet for a data scientist from end to end.

2 Data Manipulation

For most used data manipulation techniques in R, Python, and Hive, see data_manipulation_dict.xlsx.

3 Visualization

Here are the most commonly used visualizations:

3.1 Barchart

3.2 Histogram

3.3 Scatterplot

3.4 Line

## [1] "GOOG"

3.5 Boxplot

Lines mean - Q1 - 1.5 * IQR - Q1 - Median - Q3 - Q3 + 1.5 * IQR

3.6 Heatmap

3.7 Graph

3.8 Confidence Interval

3.9 Rank

3.10 Correlogram

3.11 Word Cloud

3.12 Datatable

3.12.1 Compact

3.12.2 Full Option

4 Statistical Inference

4.1 Probability

For an event A where \(P(A) \in [0,1]\),

\[P'(A) = 1 - P(A)\] \[P(A \cup B) = P(A) + P(B) - P(A \cap B)\] \[P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}\] \[P(A \cap B) = P(B|A)P(A) \]

4.2 Type I Error, Type II Error, Significance, Power and Cohen’s d

  • Type I - false P(one vertical line)ositive (alpha, significance)
  • Type II - false N(two vertical lines)egative (beta)
  • Power - chance of rejecting null hypothesis when null hypothesis is false (1-beta)
  • Cohen’s d - effect size (0.2, 0,5, 0.8 but need to read more)

4.3 Evaluation Metrics

  • Precision, positive predictive value - \(\frac{TP}{TP+FP}\)
  • Recall, sensitivity, true positive rate - \(\frac{TP}{TP+FN}\)
  • Specificity, true negative rate - \(\frac{TN}{TN+FP}\)
  • False discovery rate - 1 - precision - \(\frac{FP}{FP+TP}\)
  • Accuracy - \(\frac{TP + TN}{ALL}\)
  • F1 - harmonic mean of precision and recall - \(\frac{2TP}{2TP+FP+FN}\) $ HarmonicMean = ()^{-1} $
  • ROC - plot false positive rate (1-specificity) and true positive rate (sensitivity)
  • Kappa - accuracy normalized by class imbalance \(\frac{p_o-p_e}{1-p_e}\)
  • RMSE (l2) - \(\sqrt{\frac{(predicted-actual)^2}{N}}\); penalizes extreme values
  • MAE/MAPE (l1) - mean absolute error/mean absolute percentage error

4.4 Central Tendencies

  • Population mean - \(\mu = \frac{1}{N} * \sum{X}\)

  • Sample mean - \(\bar{x} = \frac{1}{n} * \sum{x}\)

  • Population variance - \(\sigma^2 = \frac{1}{N} * \sum{(X-\mu)^2}\)

  • Sample variance - \(s^2 = \frac{1}{n-1} * \sum{(x-\bar{x})^2}\)

  • Standard error of the means - \(SEM = s/\sqrt{n}\)

  • Confidence interval - a range that, with repeated iterations, some percentage of sample values go in. \(\bar{x} \pm t*SEM\)

  • Degree of freedom - number of values in a calculation that is free to vary. All independent variables less the number of intermediate paramenters.

4.5 Dependence, Covariance and Correlation

  • Variance - has units

\[Var(x) = E[(x-E[x])^2]\]

  • Covariance

\[cov(X,Y) = E[(x-E(x))(y-E(y))]\]

  • Correlation - scale from -1 to 1

\[corr(x,y) = \frac{cov(X,Y)}{\sigma_x * \sigma_y}\]

4.6 Distributions

Oftentimes with big data, we rely on the assumption that the data will approximate to a normal distribution. However, it helps to understand what distributions are meant for what types of data since some data such as number of purchases per customer clearly do not approximate in such manner.

4.6.1 Normal Distribution

  • Variable - continuous
  • Mean - \(\mu\)
  • SD - \(\sigma\)
  • Standard normal distribution Z(0,1)

4.6.2 Chi-square Distribution

  • Sum of squared standard normal variables
  • Variable - continuous but used to test categorical also
  • Mean - k
  • SD - \(\sqrt{2k}\)

4.6.3 Student’s t Distribution

  • fatter tail normal distribution
  • Variable - continuous but used to test categorical also
  • Mean - 0
  • SD - \(\frac{df}{df-2}\)

4.6.4 F Distribution

  • \(\frac{X^2_{d1}/d1}{X^2_{d2}/d2}\)
  • Variable - continuous but used to test categorical also
  • Mean - d2/(d2-2)
  • SD - \(\sqrt{\frac{2d2^2(d1+d2-2)}{d1(d2-2)^2(d2-4)}}\)

4.6.5 Bernoulli and Binomial Distribution

  • Binomial is getting successes with probability p of n Bernoulli experiments
  • \((n k) * p^k * (1-p)^(n-k)\)
  • Variable - discrete often used with proportions
  • Mean - np
  • SD - \(\sqrt{np(1-p)}\)

4.6.6 Poisson Distribution

  • Number of indepedent, constant-rate events in an interval
  • \(\frac{\lambda^k * e^{-\lambda}}{k!}\)
  • Variable - discrete
  • Mean - \(\lambda\)
  • SD - \(\sqrt{\lambda}\)

4.6.7 Exponential Distribution

  • Time between poisson process
  • \(\lambda * e^{-\lambda x}\)
  • Variable - continuous
  • Mean - \(\frac{1}{\lambda}\)
  • SD - \(\frac{1}{\lambda}\)

4.7 Law of Large Numbers

When the number of samples/trials is large enough, the sample mean approaches the population mean. For example, in a series of dice rolls:

4.8 Central Limit Theorem

The sample means of independent and identically distributed samples of any distribution with mean and variance will converge to a normal distribution when the number of samples increases.

## [1] "Uniform Distribution"

## [1] "Poisson Distribution"

## [1] "Exponential Distribution"

4.9 Hypothesis Testing

Testing \(\frac{\bar{x} - \mu}{SEM}\)

Goal Continuous Discrete Proportions Non-parametric

One-sample

T-test

None
Z-test

Wilcoxon test

Paired

Paired t-test

McNemar’s test

None

Wilcoxon test

Two-sample

two-sample t

Chi^2

Z-test

Mann-Whitney U

Multi-sample

ANOVA

Kruskall-Wallis

4.9.1 Continuous

Mostly t-test and its variations

  • One-sample

\[\frac{\bar{x} - \mu}{s/\sqrt{n}}\]

  • Two-sample

\[\frac{(\bar{x_1} - \bar{x_2}) - (\mu_1-\mu_2)}{s_p * \sqrt{\frac{1}{m} + \frac{1}{n}}}\]

\[s_p=\sqrt{\frac{(m-1)(s_{x_1})^2 + (n-1)(s_{x_2})^2}{m+n-2}}\]

  • Paired - use standard deviation of the difference between group 1 and group 2
## 
##  Paired t-test
## 
## data:  a and b
## t = 0.98041, df = 99, p-value = 0.3293
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1329513  0.3926566
## sample estimates:
## mean of the differences 
##               0.1298527

4.9.2 Discrete

  • Chi^2 test - null hypothesis is that the two variables are independent.

\[Expected = \frac{total_1}{total_{all}} * \frac{total_2}{total_{all}} * total_{all}\]

\[X^2 = \frac{(observed - expected)^2}{expected}\]

\[df = (m-1) * (n-1)\]

where m and n are respective number of levels.

Used for product mix, matrix-wise CTR and other discrete variables in contingency tables.

##   clicked not
## 1      10  50
## 2      20  40
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  c
## X-squared = 3.6, df = 1, p-value = 0.05778
  • McNemar’s Test - paired data; e.g. before and after

\[X^2 = \frac{(b-c)^2}{b+c}\]

where b and c are the number of discordant pairs. Chi^2 distribution with degree of freedom 1.

4.9.3 Proportions

Normal distribution with enough number of samples. Used for CTR, conversion and other proportions.

  • One sample

\[z = \frac{(p-p_0)}{\sqrt{\frac{p_0(1-p_0)}{n}}}\]

  • Two sample

\[z = \frac{(p_1-p_2)}{\sqrt{p(1-p) * (\frac{1}{n_1}+\frac{1}{n_2})}}\]

where p is overall proportions including both groups

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(32, 54) out of c(1000, 1000)
## X-squared = 5.3583, df = 1, p-value = 0.02062
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.040754721 -0.003245279
## sample estimates:
## prop 1 prop 2 
##  0.032  0.054

4.9.4 ANOVA

Null hypothesis that all central values are the same. Assumes homoskedasticity and normality of outcomes.

  • N - total number of samples

  • k - number of groups

  • ni - number of members of group i

Mean Squares

  • Total Mean Squares - \(\frac{\sum{(x-\bar{x.all})^2}}{N-1}\)

  • Between-grop Mean Squares - \(\frac{\sum{(\bar{x.group}-\bar{x.all})^2}}{k-1}\)

  • Within-group Mean Squares - \(\frac{\sum{(x-\bar{x.group})^2}}{N-k}\)

\(F = \frac{Between-grop Mean Squares}{Within-group Mean Squares}\)

4.9.5 P-value Controversy

The probability of finding the observed data given the null hypothesis.

What most people think of it as:

\(P(Null|Data) = \frac{P(Null) * P(Data|Null)}{P(Data)} = \frac{P(Null) * P(Data|Null)}{P(Null) * P(Data|Null) + P(!Data) * P(Null|!Data)}\)

What it really is:

\(P(Data|Null)\)

Example: Out of 100 samples, 0.1 prevalence, 0.8 power, and 0.05 significance. We will detect 8 out of 10 true positives and detect 5 false positives. Precision is (13-5)/13 = 61.5%, not 95%.

\(\frac{0.1*0.8}{(0.1*0.8)+(0.05*0.9)}\)

4.9.6 Correction for Multiple Experiments

The more k, the more likely to get false positives.

\(P(!Data|Null) = (1-a)^k\)

  • Bonferroni corrections - most conservative

\[a_c = \frac{a}{k}\] where k is number of experiments

  • Sidak corrections - assumes statistical independence among tests

\[ a = 1 - (1-a_c)^k \] \[ a_c = 1 - (1-a)^{1/k} \]

  • Benjamini-Hochberg procedure

  • Order m hypotheses according to lower p-value on a rank-p plane

  • Draw a line \(p < \frac{a}{m} * rank\)

  • Find lowest ranks that are still above the line

  • Limit the false discovery rate \(FDR < \frac{a}{m} * TP\)

4.10 Model Selection

  • Likelihood ratio - Chi-squared; similar to F-test

\[ LR = \frac{TPR}{FPR} = \frac{sensitivity}{1-specificity} \]

\[ \lambda = \frac{LR_{fitModel}}{LR_{reducedModel}} \]

\[LL = -2log\lambda\]

  • Akaike Information Criterion

AIC penalizes more features.

\[AIC = LL + 2k\]

where k is the number of features

  • Bayesian Information Criterion

Stricter than AIC by log(n)

\[BIC = LL + klog(n)\]

where n is the number of examples.

5 Preprocessing

5.1 Standardization

Make them look like Z(0,1)

\[ x_{standardized} = \frac{x - \overline{x}}{s_x} \]

5.2 Min-max Normalization

\[ x_{normalized} = \frac{x - min(x)}{max(x) - min(x)} \]

5.3 Imbalance Classes

We handle imbalance classes with:

5.3.1 Undersampling/Oversampling/Both

  • Undersampling - sample majority at equal number as minority
  • Undersampling - sample minority with replacement at equal number as majority
  • Both - sample both to be the same number as overall sample

## 
##   0   1 
## 980  20

## 
##  0  1 
## 20 20

## 
##   0   1 
## 980 978

## 
##   0   1 
## 500 500

5.3.2 SMOTE

Synthetic Minority Over-sampling Technique: The algorithm selects k neighbors (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.

## 
##    0    1 
## 2000 1020

5.3.3 Tree-based Models

Because tree-based models use information criteria to split instead.

5.4 Dimension Reduction

5.4.1 SVD

Truncated SVD used to reduce dimensions.

\[ M = U \Sigma V^T \]

dimensions are m x m, m x n and n x n respectively.

5.4.2 PCA

  • Standardize features

  • Compute covariance matrix \((1/m) * X'X\)

  • Perform SVD; Find eigenvector U of X’X

\[ A = U * S * V' \]

  • \(Z = U'X\) - [k x n] * [n x 1] = [k*1]
## Standard deviations:
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
## 
## Rotation:
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

6 Loss Functions

6.1 Hinge Loss

Maximizes the margin between two classes with hyperparameter delta

6.1.1 Forward

For an example:

Loss per class per example is positive only when the score of wrong class is more than score of correct class by delta. \[L_{i} = \Sigma max(0,f_{j} - f_{k} + \Delta)\] where j != k

6.1.2 Backward

Derivative of loss function by weight \[\frac{dL_{i}}{dW} = \frac{dL_{i}}{df_{j-or-k}} * \frac{df_{j-or-k}}{dW} = \frac{dL_{i}}{df_{j-or-k}} * X_{i}\]

Minus sum of everything that has enough delta \[\frac{dL_{i}}{df_{k}} = -\Sigma 1(f_{j} - f_{k} + \Delta)\]

Plus one for any other arbitrary classes \[\frac{dL_{i}}{df_{j}} = 1(f_{j} - f_{k} + \Delta > 0)\]

6.2 Cross Entropy Loss

6.2.1 Forward

For an example: \[L_{i} = -log(p_{k})\] \[p_{k} = \frac{e^{f_{k}}}{\Sigma e^{f_{j}}}\] \[f = X_{i}W + b\] when i = example; k = correct class; j = any class

For regularization term: \[l2 = \frac{\lambda}{2}W^2\] \[l1 = \frac{\lambda}{2}|W|\]

Overall Loss: \[L = \frac{\Sigma L_{i}}{N} + l1/l2\]

6.2.2 Backward

Derivative of loss function by weight \[\frac{dL_{i}}{dW} = \frac{dL_{i}}{df_{k}} * \frac{df_{k}}{dW} = \frac{dL_{i}}{df_{k}} * X_{i}\] \[\frac{dL_{i}}{df_{k}} = \frac{-dlog(p_{k})}{df_{k}} = \frac{-1}{p_{k}}\frac{dp_{k}}{df_{k}} = \frac{-1}{p_{k}}\frac{d(\frac{e^{f_{k}}}{\Sigma e^{f_{j}}})}{df_{k}}\] \[\frac{d(\frac{e^{f_{k}}}{\Sigma e^{f_{j}}})}{df_{k}} = \frac{\Sigma e^{f_{j}}e^{f_{k}} - e^{2f_{k}}}{(\Sigma e^{f_{j}})^2} = p_{k} - p_{k}^2 = p_{k}(1-p_{k})\] \[\frac{dL_{i}}{df_{k}} = p_{k} - 1(y_{i}=k)\] \[\frac{dL_{i}}{dW} = [p_{k} - 1(y_{i}=k)] * X_{i}\]

6.3 L1 and L2 Loss

6.3.1 Forward

\[L_i = \Sigma (y - f_j)^2 \]

\[L_i = \Sigma |y - f_j| \]

6.3.2 Backward

\[\frac{dL_{i}}{dW} = \frac{dL_{i}}{df_j} * \frac{df_j}{dW} = \frac{dL_{i}}{df_j} * X_{i}\] \[\frac{dL_{i}}{dW} = -2(y-f_j) * X_i\]

7 Information Measures

Let \(p_i\) be the percentage of examples labelled as i

7.1 Gini impurity

The more, the less homogenous.

\[I = \Sigma p_i(1-p_i)\]

7.2 Information gain

Difference between entropy of parent and child

\[IG = H(T) - H(T|a)\]

\[H(T) = - \Sigma p_ilog_2p_i\]

7.3 Variance reduction

l2 loss reduction used in regression tree.

8 Distances

\[cosineSimilarity = \frac{A \cdot B}{||A||_{2}||B||_{2}} = \frac{\Sigma A_i B_i}{\sqrt{\Sigma A_i^2 \Sigma B_i^2}}\]

\[JaccardSimilarity = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}\]

\[Manhattan(l1) = \Sigma|A_i-B_i|\]

\[Euclidean(l2) =\sqrt{\Sigma(A_i-B_i)^2}\]

9 Algorithms

Plot training size vs error or polynomial degree vs error with validation and training set

  • More training examples - fix high variance
  • Smaller features - fix high variance
  • More features - fix high bias
  • Polynomial - fix high bias
  • Smaller \(\lambda\), bigger C - fix high bias
  • Bigger \(\lambda\), smaller C - fix high variance

9.1 Unsupervised

  • isotropic: same in all directions (standardized basically); not elongated clusters
  • convex: a line penetrates the shape and only meets two boundaries; not twisted shape clusters

9.1.1 K-mean Clustering

The implementation can be divided into the following:

  • Handle Data: Clean the file, normalize the parameters, given numeric values to non-numeric attributes. Read data from the file and split the data for cross validation.
  • Find Initial Centroids: Choose k centroids in random.
  • Distance Calculation: Finding the distance between each of the datapoints with each of the centroids. This distance metric is used to find the which cluster the points belong to.
  • Re-calculating the centroids: Find the new values for centroid.
  • Stop the iteration: Stop the algorithm when the difference between the old and the new centroids is negligible.

Notable features:

  • Assume equal variances within groups

  • For convex and isotropic

  • Require standardization

  • when it sucks

9.1.2 Hierachical Clustering

Builds the hierarchy from the individual elements by progressively merging clusters. It minimizes: * Ward: sum of square differences * Complete linkage: maximum distance * Average linkage: average distance

9.2 Supervised

9.2.1 Linear Regression

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

9.2.1.1 OLS Assumptions

BLUE - best linear unbiased estimator iid - independent and identically distributed

  • Linearity - residual/observed vs predicted plot
  • Normality of errors - QQ plot, theoretical vs data quantile
  • No multicollinearity - Variance inflation factor
  • No autocorrelation of errors - Durbin-Watson test, residual by time/index
  • Homoskedasticity of estimators and errors - Breusch–Pagan test, constant covariance between estimators and errors

9.2.1.2 Diagnostics

  • F-stat - model specification
  • Adjusted R-squared - inflation-adjusted variation explained
  • t-stat - coefficient significance

9.2.2 Decision Tree

Perform splits on each node according to information measures.

9.2.3 K-nearest Neighbor

Nearest k points take a vote according to distance * Small k; fast, more efficient, more strict * Large k; bias towards popular labels, ignores outliers

9.2.4 SVM

  • Hinge loss
  • Linear classifier
  • Maximize margin
  • Kernel tricks aka substituting the score function: polynomial, radial, tanh
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1000 samples
##    6 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE       Rsquared 
##   0.25  0.4628845  0.8889003
##   0.50  0.4638732  0.8868725
##   1.00  0.4716239  0.8825424
## 
## Tuning parameter 'sigma' was held constant at a value of 0.04650196
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 0.04650196 and C = 0.25.

9.2.5 Softmax

Multi-class generalization of logistic regression.

## 
## Call:
## glm(formula = vs ~ wt + gear, family = "binomial", data = mtcars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6425  -0.7830  -0.1126   0.5729   1.7588  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  10.9370     5.4987   1.989   0.0467 *
## wt           -2.5077     0.9866  -2.542   0.0110 *
## gear         -0.9044     0.8124  -1.113   0.2656  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.860  on 31  degrees of freedom
## Residual deviance: 29.944  on 29  degrees of freedom
## AIC: 35.944
## 
## Number of Fisher Scoring iterations: 5

9.2.6 Naive Bayes

\[P(spam|word) = \frac{P(word|spam) * P(spam)}{P(word)}\] \[P(spam|word)= \frac{P(word1, word2,word3|spam) * P(spam)}{P(word)}\] \[P(spam|word)= \frac{P(word1|spam) * P(word2,word3|spam,word1) * P(spam)}{P(word)}\] It’s naive because it assumes that chances of words appearing are independent \[P(spam|word) = \frac{P(word1|spam) * P(word2|spam)* P(word3|spam) * P(spam)}{P(word)}\]

## Naive Bayes 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9525666  0.9281677
##    TRUE      0.9491480  0.9229907
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
##  and adjust = 1.

9.2.7 Collaborative Filtering

Matrix multiplication based on similarity matrix or embedded vectors.

9.2.8 Ensemble

9.2.8.1 Gradient Boosting

Fit h(x) aka the model improvement term to residuals of previous model to improve the model.

\(h(x) = y - F_m\)

\(F_{m+1} = F_m + v * h(x)\)

v = shrinkage (learning rate)

9.2.8.2 Random Forest

  • draw bootstrap samples
  • train decision tree; until tree is maximised, select random m attributes from p available
  • split on gini impurity criterion
  • measure out-of-bag error
    • evaluate against sample that were left out
    • provide
      • measure of strength (inverse error rate)
      • correlation between trees
      • variable importance: if you scramble the values and the accuracy doesn’t change much then the variable is not important; do for easier interpretation
  • make decision by vote of k trees
  • easy to parallel
  • handles “small n big p” naturally

9.2.9 Markov Chain

Prediction procedure:

  1. Separate network-event chains to those that become NC and those that do not.
  2. Create two markov chain models and transition matrices.
  3. To predict, construct the log-odds of a given chain as follows:

\[logOdds_{j} = log(\frac{prob_{positive}}{prob_{negative}})\] \[logOdds_{i} = \Sigma logOdds_{j}\] \[P(NC) = \frac{e^{logOdds_{i}}}{1+e^{logOdds_{i}}}\] where i is an example and j is a transition within the example.

10 Neural Networks

10.1 Layers

10.1.1 ReLu

\(max(0,f_j)\)

10.1.2 Convolution

Correlation filter turned 180 degrees.

10.1.3 Max Pooling

Reduce resolution by choosing max value.

10.1.4 LSTM

Has an additive memory cell to prevent vanishing gradients. Let \(s_t\) be the score.

10.1.4.1 Gates

Forget gate: how much of old information to retain

\[f_t = \sigma_{sigmoid}(s_t + U_fh_{t-1})\]

Information gate: how much of new information to receive

\[i_t = \sigma_{sigmoid}(s_t + U_ih_{t-1})\]

Output gate

\[o_t = \sigma_{sigmoid}(s_t + U_ih_{t-1})\]

10.1.4.2 Cell State

This is where the magic happens.

\[c_t = f_t \circ c_{t-1} + i_t \circ \sigma_{tanh}(s_t + U_ih_{t-1})\]

10.1.4.3 Output

\[h_t = o_t \circ \sigma_{tanh}(ct)\]

10.1.5 Dropout

Drop some features.

10.1.6 Batch Normalization

Normalize activatin to Z(0,1)

10.1.7 Embedding

Map integer to vector space

10.1.8 Softmax

10.2 Gradient Learning

  • Vanilla
  • Stochastic: one by one or by batch
  • Momentum: accelerate in relevant directions; incorporate decaying average of past gradients

\[\frac{dL_{i}}{dW} = \alpha * \frac{dL_{i}}{dW} + \beta * \frac{dL_{i-1}}{dW}\]

  • adagrad: adaptive learning rate per parameter; divide by sqrt of average of past squared gradient

\[learn_{\beta} = \frac{learn_{base}}{\sqrt{\Sigma (\frac{dL}{d\beta})^2}}\]

  • rmsprop: adaptive learning rate per parameter; exponentially decaying average of past squared gradient

\[learn_{\beta} = \frac{learn_{base}}{\gamma * rms_{current} + \nu * rms_{previous}}\]

  • adam: rmsprop + momentum
  • eve: anneal learning rate based on recent volatility of loss; divide it by dt

\[ r_{t} = \frac{|loss_{t} - loss_{t-1}|}{max(loss_{t},loss_{t-1})}\] \[ dt = \gamma * r{t} + \nu * r_{t-1} \]

  • adam_ann: eve with average sum of squares of gradients instead of loss